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Abstract —Robust automated organ segmentation is a prereq¬ 
uisite for computer-aided diagnosis (CAD), quantitative imaging 
analysis, detection of pathologies and surgical assistance. For 
anatomical high-variability organs such as the pancreas, previous 
segmentation approaches report low accuracies in comparison 
to well studied organs like the liver or heart. We present a 
fully-automated bottom-up approach for pancreas segmentation 
in abdominal computed tomography (CT) scans. The method is 
based on a hierarchical cascade of information propagation by 
classifying image patches at different resolutions and cascading 
(segments) superpixels. There are four stages in the system: 1) 
decomposing CT slice images as a set of disjoint boundary- 
preserving superpixels; 2) computing pancreas class probability 
maps via dense patch labeling; 3) classifying superpixels by 
pooling both intensity and probability features to form empirical 
statistics in cascaded random forest frameworks; and 4) sim¬ 
ple connectivity based post-processing. The dense image patch 
labeling is conducted by two schemes: efficient random forest 
classifier on image histogram, location and texture features; and 
more expensive (but with better specificity) deep convolutional 
neural network classification, on larger image windows (i.e., with 
more spatial contexts). Over-segmented 2D CT slices by the 
Simple Linear Iterative Clustering approach are adopted through 
model/parameter calibration and labeled at the superpxiel level 
for positive (pancreas) or negative (non-pancreas background) 
classes. Evaluation of the approach is done on a database of 80 
manually segmented CT volumes in six-fold cross-validation. Our 
achieved results are comparable, or better than the state-of-the- 
art methods (evaluated by “leave-one-patient-out”)? with a Dice 
coefficient of 70.7% and Jaccard Index of 57.9%. In addition, 
the computational efficiency has been drastically improved in the 
order of 6 ~ 8 minutes, comparing with others of > 10 hours 
per testing case. The segmentation framework using deep patch 
labeling confidences is also more numerically stable, refiected 
by the smaller performance metric standard deviations. Finally, 
we implement a multi-atlas label fusion (MALF) approach for 
pancreas segmentation using the same dataset. Under six-fold 
cross-validation, our bottom-up segmentation method signifi¬ 
cantly outperforms its MALF counterpart: 70.7 zb 13.0% versus 
52.51 zb 20.84% in Dice coefficients. 


1. Introduction 

Image segmentation is a key step in image understanding 
that aims at separating objects within an image into classes, 
based on object characteristics and a prior information about 
the surroundings. This also applies to medical image analysis 
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Fig. 1. 3D manually segmented volumes of six pancreases from six 
patients.Notice the shape and size variations. 


in various imaging modalities. The segmentation of abdominal 
organs such as the spleen, liver and pancreas in abdominal 
computed tomography (CT) scans can be an important input to 
computer aided diagnosis (CAD) systems, for quantitative and 
qualitative analysis and for surgical assistance. In the instance 
of quantitative imaging analysis of diabetic patients, a requisite 
critical step for the development of such CAD systems is seg¬ 
mentation specifically of the pancreas. Pancreas segmentation 
is also a necessary input for subsequent methodologies for 
pancreatic cancer detection. The literature is rich in methods 
of automatic segmentation on CT with high accuracies (e.g., 
Dice coefficients > 90%), of other organs such as the kidneys 
m, lungs lO, heart 13 and liver in. Yet, high accuracy in 
automatic segmentation of the pancreas remains a challenge. 
The literature is not as abundant in either single- or multi¬ 
organ segmentation setups. 

The pancreas is a highly anatomically variable organ in 
terms of shape and size and the location within the abdominal 
cavity shifts from patient to patient. The boundary contrast can 
vary greatly by the amount of visceral fat in the proximity of 
the pancreas. These factors and others make segmentation of 
the pancreas very challenging. Fig. [^depicts several manually 
segmented 3D volumes of various patient pancreases to better 
illustrate the variations and challenges mentioned. From the 
above observations, we argue that the automated pancreas 
segmentation problem should be treated differently, apart from 
the current organ segmentation literature where statistical 
shape models are generally used. 
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In this paper, a new fully bottom-up approach using im¬ 
age and (deep) patch-level labeling confidences for pancreas 
segmentation is proposed using 80 single phase CT patient 
data volumes. The approach is motivated to improve the 
segmentation accuracy of highly deformable organs, like the 
pancreas, by leveraging middle-level representation of image 
segments. First, over segmentation of all 2D slices of an input 
patient abdominal CT scan is obtained as a semi-structured 
representation known as superpixels. Second, classifying su¬ 
perpixels into two semantic classes of pancreas and non¬ 
pancreas is conducted as a multi-stage feature extraction and 
random forest (RF) classification process, on the image and 
(deep) patch-level confidence maps, pooled at the superpixel 
level. Two cascaded random forest superpixel classification 
frameworks are addressed and compared. Fig. depicts the 
overall proposed first framework. Fig. illustrates the mod¬ 
ularized fiow charts of both frameworks. Our experimental 
results are carried-out in a six-fold cross-validation manner. 
Our system runs at about two orders of magnitude more 
computationally efficiently to process a new testing case than 
the atlas registration based approaches 13, (3, (71, HU, m, 
ca. The obtained results are comparable, or better than 
the state-of-the-art methods (evaluated by ’’leave-one-patient- 
out”), with a Dice coefficient of 70.7% and Jaccard Index of 
57.9%. Under the same six-fold cross-validation, our bottom- 
up segmentation method significantly outperforms its “multi¬ 
atlas registration and joint label fusion” (MALF) counterpart 
(based on our implementation using nn . (ni) : Dice coef¬ 
ficients 70.7 ± 13.0% versus 52.51 ± 20.84%. Additionally, 
another bottom-up supervoxel based multi-organ segmentation 
without registration in 3D abdominal CT images is also 
investigated ca in a similar spirit, for demonstrating this 
methodological synergy. 

II. Previous Work 

The organ segmentation literature can be divided into two 
broad categories: top-down and bottom-up approaches. In top- 
down approaches, a-priori knowledge such as atlas(es) and/or 
shape models of the organ are generated and incorporated 
into the framework via learning based shape model fitting 
Kl, El, O or volumetric image registration 0. 0. ilTOl . 
For bottom-up approaches segmentation are performed by 
local image similarity grouping and growing El or pixel, 
superpixel/supervoxel based labeling oa since direct repre¬ 
sentations of the organ is not incorporated. Generally speaking, 
top-down methods are targeted for organs which can be 
modeled well by statistical shape models |[T|, El whereas 
bottom-up representations are more effective for highly non- 
Gaussian shaped ca or pathological organs. 

Previous work on pancreas segmentation from CT images 
have been dominated by top-down approaches which rely on 
atlas based approaches or statistical shape modeling or both 
Q, lEl, Q, 0, 0, lUni- 0 is generally not comparable to 
others since it uses three-phase contrast enhanced CT datasets. 
The rest are performed on single phase CT images. 

• Shimizu et. al 0 utilize three-phase contrast enhanced 
CT data which are first registered together for a partic¬ 
ular patient and then registered to a reference patient 


by landmark-based deformable registration. The spatial 
support area of the abdominal cavity is reduced by 
segmenting the liver, spleen and three main vessels as¬ 
sociated with location interpretation of the pancreas (i.e. 
splenic, portal and superior mesenteric veins). Coarse-to- 
fine pancreas segmentation is performed by using gener¬ 
ated patient-specific probabilistic atlas guided segmenta¬ 
tion followed by intensity-based classification and post¬ 
processing. Validation of the approach was conducted on 
20 multi-phase datasets resulting in a Jaccard of 57.9%. 

• Okada et. al 0 perform multi-organ segmentation by 
combining inter-organ spatial interrelations with prob¬ 
abilistic atlases. The approach incorporated various a- 
priori knowledge into the model that includes shape 
representations of seven organs. Experimental validation 
was conducted on 28 abdominal contrast-enhanced CT 
datasets obtaining an overall volume overlap of Dice 
index 46.6% for the pancreas. 

• Chu et. al 0 present an automated multi-organ seg¬ 
mentation method based on spatially-divided probabilistic 
atlases. The algorithm consists of image-space division 
and a multi-scale weighting scheme to deal with the large 
differences among patients in organ shape and position in 
local areas. Their experimental results show that the liver, 
spleen, pancreas and kidneys can be segmented with Dice 
similarity indices of 95.1%, 91.4%, 69.1%, and 90.1%, 
respectively, using 100 annotated abdominal CT volumes. 

• Wolz et. al m may be considered the state-of-the-art 
result thus far for single-phase pancreas segmentation. 
The approach is a multi-organ segmentation approach that 
combines hierarchical weighted subject-specific atlas- 
based registration and patch-based segmentation. Post¬ 
processing is in the form of optimized graph-cuts with a 
learned intensity model. Their results in terms of a Dice 
overlap for the pancreas is 69.6% on 150 patients and 
58.2% on a sub-population of 50 patients. 

• Recent work by Wang et. al Co) proposes a patch-based 
label propagation approach that uses relative geodesic 
distances. The approach can be considered a start to 
developing some bottom-up component for segmentation, 
where affine registration between dataset and atlases were 
conducted followed by refinement using the patch-based 
segmentation to reduce misregistrations and instances of 
high anatomy variability. The approach was evaluated on 
100 abdominal CT scans with an overall Dice of 65.5% 
for the pancreas segmentation. 

The default experimental setting in many of the atlas based 
approaches 0, (3, Q, (3,0,03 is conducted in a “leave- 
one-patient-out” or “leave-one-out” (LOO) criterion for up to 
N=150 patients. In the clinical setting leave-one-out based 
dense volume registration (from all other N-1 patients as atlas 
templates) and label fusion process may be computationally 
impractical hours per testing case). More importantly, it 
does not scale up easily when large scale datasets are present. 

The proposed bottom-up approach is significantly more 
efficient in memory and computation speed than the multi-atlas 
registration framework ifTOl . 0, 0, 0, 0, 0. Quantitative 


3 


I Input CT I 

• Portal-venous 
abdominal CT datasets 



Fig. 2. Overall pancreas segmentation framework via dense image patch 
labeling. 


evaluation on 80 manually segmented CT patient volumes 
under six-fold cross-validation (CV) are conducted. Our results 
are comparable, or better than the state-of-the-art methods 
(even under “leave-one-patient-out”, or LOO), with a Dice co¬ 
efficient 70.7% and Jaccard Index 57.9%. The strict numerical 
performance comparison is not possible since experiments can 
not be performed on the same dataset^ We instead implement 
a multi-atlas label fusion (MALF) pancreas segmentation 
approach ifTTIl . lIT^ using our datasets. Under six-fold CV, 
our bottom-up segmentation method clearly outperforms its 
MALF counterpart: Dice coefficients 70.7 ± 13.0% versus 
52.51 ± 20.84%. 

Superpixel based representation on pathological region de¬ 
tection and segmentation is recently studied in ca using MRI 
datasets. However the problem representation, visual feature 
extraction and classification framework ca vary drastically 
from our approach. Our bottom-up image parsing method, 
especially using superpixels as an intermediate level image 
representation is partially inspired by similar approaches in 
PASCAL semantic segmentation ca and scene labeling ini, 
ifTSl . The technical and algorithm details are significantly 
different from na, ini, ca, especially in the sense that the 
pancreas is a relatively small organ and the segmentation areas 
or volumes of foreground/background (i.e., pancreas versus 
other) classes are highly unbalanced. Pancreas normally only 
consumes less than 1% of space given an input CT scan. Here 
we use “Cascaded Superpixels” representation as a solution to 
address this domain-specific challenge that may widely exist 
in medical image analysis. 

In this paper, we have generalized the proposed algorithm 
framework (Fig. in d, as a preliminary version. From 

^Our annotated pancreas segmentation datasets are publicly available at 
https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT, to ease fu¬ 
ture comparisons. 


the second proposed framework (F-2) in Fig. [T^ deep convo¬ 
lutional neural networks (CNN) based image patch labeling 
(to leverage richer CNN image features) is integrated and 
thoroughly evaluated (Sec. |III-C and Sec. III-D| are added as 
new content). More importantly, the representation aspects of 
different algorithm variations and experimental configurations; 
quantitative comparison to MALF CD, C2 and superpixel 
based CNNs 1201 , 1^ under 6 or 4-fold CV, and technical 
insights are extensively enriched to improve the completeness 
of this manuscript. 


HI. Methods 

In this section the components of our overall algorithm fiow 
(shown in Fig.|^ is first addressed (Sec. |III-A] and Sec. III-B ). 
The method extensions on exploiting sliding-window CNN 
based dense image patch labeling and framework variations 
are described in Sec. IIII-CI and Sec. IIII-DI 


A. Boundary-preserving over-segmentation 

Over-segmentation occurs when images (or more generally 
grid graphs) are segmented or decomposed into smaller per¬ 
ceptually meaningful regions, “superpixels”. Within a super¬ 
pixel, pixels carry similarities in color, texture, intensity, etc. 
and generally align with image edges rather than rectangular 
patches (i.e. superpixels can be irregular in shape and size). In 
the computer vision literature, numerous approaches have been 
proposed for superpixel segmentation fT2\ . |[23]| . 1^ . (251 , 
|[26l . Each approach has its drawbacks and advantages but 
three main properties are generally examined when deciding 
the appropriate method for an application as discussed in 
(231: 1) adherence to image boundaries; 2) computationally 
fast, ease of usage and memory efficient; especially when 
computational complexity reduction is of importance and 3) 
improvement on both quality and speed of the final segmen¬ 
tation output. 

Superpixel methods fall under two main broad categories: 
graph-based (e.g. SLIC (221 . entropy rate (^ and (25l) and 
gradient ascent methods (e.g. watershed (^ and mean shift 
(27l). In terms of computational complexity, flSl , (26l are 
relatively fast in 0{MlogM) complexity where M is the 
number of pixels or voxels in the image or grid graph. Mean 
shift (271 and normalized cut (28l are O(M^), or 0(Mi), 
respectively. Simple linear iterative clustering (SLIC) (22l is 
both fast and memory efficient. In our work, evaluation and 
comparison among three graph-based superpixel algorithms 
(i.e. SLIC (22l . (23l . efficient graph-based (25l and Entropy 
rate CD) and one gradient ascent method (i.e. watershed (26l ) 
are conducted, considering the three criterion in (23l . Fig. 
shows sample superpixel results using the SLIC approach. The 
original CT slices and cropped zoomed-in pancreas superpixel 
regions are demonstrated. The boundary recall, a typical mea¬ 
surement used in the literature, to indicate how many “true” 
edge pixels of the ground truth object segmentation are within 
a pixel range from the superpixels (i.e., object-level edges 
are recalled by superpixel boundaries). High boundary recall 
indicates minimal true edges were neglected. Fig. shows 
sample quantitative results. Based on Fig. high boundary 
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Fig. 4. Superpixels boundary recall results evaluated on 20 patient scans 
(Distance in millimeters). The watershed method El is shown in red, efficient 
graph (25) in blue while the SLIC (22) and the Entropy rate El based 
methods are depicted in cyan and green, respectively. The red line represents 
the 90% marker. 


Fig. 3. Sample superpixel generation results from the SLIC method (22). First 
column depicts different slices from different patient scans with the ground- 
truth pancreas segmentation in yellow (a, d & g). The second column depicts 
the over segmentation results with the pancreas contours superimposed on the 
image (b, e & h). Last, (c) (f) and (i) show zoomed-in areas of the pancreas 
superpixel results from (b) (e) and (h). 


recalls, within the distance ranges between 1 and 6 pixels from 
the semantic pancreas ground-truth boundary annotation are 
obtained using the SLIC approach. The watershed approach 
provided the least promising results for usage in the pancreas, 
due to the lack of conditions in the approach, to utilize bound¬ 
ary information in conjunction with intensity information 
as implemented in graph-based approaches. The superpixel 
number range per axial image is constrained G [100, 200] to 
make a good trade-off on superpixel dimensions or sizes. 

The overlapping ratio r of the superpixel versus the ground- 
truth pancreas annotation mask is defined as the percentage 
of pixels/voxels inside each superpixel that are annotated as 
pancreas. By thresholding on r, say if r > r the superpixel 
will be labeled as pancreas and otherwise as background, we 
can obtain the pancreas segmentation results. When r = 0.50, 
the achieved mean Dice coefficient is 81.2%zb3.3% which is 
referred as the “Oracle” segmentation accuracy since comput¬ 
ing r would require to know the ground-truth segmentation. 
This is also the upper bound segmentation accuracy for our 
superpixel labeling or classification framework. 81.2 ±3.3% is 
significantly higher and numerically more stable (in standard 
deviation) than previous state-of-the-art methods Co), 0, ii, 
a, a, to provide considerable improvement space of our 
work. Note that both the choices of SLIC and r = 0.50 are 
calibrated using a subset of 20 scans. We find there is no need 
to evaluate different superpixel generation methods/parameters 
and rs as “model selection” using the training folds in each 
round of six-fold cross-validation. This superpixel calibration 
procedure is generalized well to all our datasets. Voxel-level 
pancreas segmentation can be propagated from superpixel- 
level classification and further improved by efficient narrow- 
band level-set based curve evolution 1291 . or the learned 


intensity model based graph-cut [171. 

Mid-level visual representation like superpixels has been 
widely adopted in recent PASCAL VOC semantic segmentation 
challenges Q6l from computer vision. Generally speaking, a 
diverse set of plausible segments are expected per image to be 
fully exploited by later employed probabilistic models Eol- 
Particularly, CPMC (Constrained Parametric Min-Cuts) based 
image segment generation process has achieved the state-of- 
the-art results for 20 object categories ED. Segments are 
extracted around regularly placed foreground seeds against 
varying background seeds (with respect to to image boundary 
edges) using all levels of foreground bias ED- CPMC has the 
effect of producing diverse segmentations at various locations 
and spatial scales however it is computationally expensive (in 
the order of 10 minutes for a natural image of 256x256 pixels 
where our CT slices are at the resolution of 512x512). For 
many patients (especially those with low body fat indexes) in 
our study, the contrast strengths of pancreas boundaries can 
be much weaker than PASCAL images (where the challenge 
more lies on the visually cluttered nature of multi-class ob¬ 
jects). CPMC may not be effective on preserving pancreas 
boundaries. 


Extension of superpixels to supervoxels is possible but in 
this work preference is made to 2D superpixel representation, 
due to the potential boundary leakage problem of supervoxels 
that may deteriorate the pancreas segmentation severely in 
multiple CT slices. Object level pancreas boundaries in ab¬ 
dominal CT images can appear very weakly, especially for 
patients with less body fat. On the other hand, as in Sec. 


III-B image patch based appearance descriptors and statistical 


models are well developed for 2D images (in both computer 
vision and medical imaging). Using 3D statistical texture 
descriptors and models to accommodate supervoxels can be 
computationally expensive, but without obvious performance 
improvement CD, EH 

In the next Sec. HITb] and Sec. [Iircl we describe two 
different methods for dense image patch labeling using his¬ 
togram/texture features and a deep Convolutional Neural Net¬ 
work. 
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Fig. 6. Examples of computed KDE histogram (a) and sample 2D probability 
response map (c) of the original CT in (b). Red presents higher value; blue 
for lower probability. 




Fig. 5. Sample slice with center positions superimposed as green dots. The 
25 X 25 image patch and corresponding D-SIFT descriptors are shown to the 
right of the original image. 


B. Patch-level Visual Feature Extraction and Classification: 

pRF 

Feature extraction is a form of object representation that 
aims at capturing the important shape, texture and other salient 
features that allow distinctions between the desired object (i.e. 
pancreas) and the surrounding to be made. In this work a total 
of 46 patch-level image features to depict the pancreas and 
its surroundings are implemented. The overall 3D abdominal 
body region per patient is first segmented and identified using 
a standard table-removal procedure where all voxels outside 
the body are removed. 

1) , To describe the texture information, we adopt the 
Dense Scale-Invariant Feature transform (dSIFT) approach 
13^ which is derived from the SIFT descriptor with 
several technical extensions. The publicly available VLFeat 
implementation of the dSIFT is employed 13^ . Fig. [^depicts 
the process implemented on a sample image slice. The de¬ 
scriptors are densely and uniformly extracted from image grids 
with inter-distances of 3 pixels. The patch center position are 
shown as the green points superimposed on the original image 
slice. Once the positions are known, the dSIFT is computed 
with the geometry of [1x2] bins and bin size of 6 pixels, 
which results in a 32 dimensional texture descriptor for each 
image patch. The image patch size in this work is fixed at 
25 X 25 which is a trade-off between computational efficiency 
and description power. Empirical evaluation of the image patch 
size is conducted for the size range of 15 to 35 pixels using a 
small sub-sampled dataset for classification, as described later. 
Stable performance statistics are observed and quantitative 
experimental results using the default patch size of 25x25 
pixels are reported. 

2) , A second feature group using the voxel intensity his¬ 
tograms of the ground-truth pancreas and the surrounding 
CT scans is built in the class-conditional probability density 
function (PDF) space. A kernel density estimator (KD^ 
is created using the voxel intensities from a subset of ran¬ 
domly selected patient CT scans. The KDE represents the 
CT intensity distributions of the positive {X+} and negative 
class {X~} of pancreas and non-pancreas voxels CT image 
information. All voxels containing pancreas information are 

^ http: //w w w. ic s .uci. edu/~ ihler/code/kde. html 


considered in the positive sample set, yet, since negative voxels 
far outnumber the positive only 5% of the total number from 
each CT scan (by random resampling) is considered. Let, 
{X+} = {ht,hp---,ht) and {X-} = (h^,h^,■ ■ ■ 
where and represent the intensity values for the positive 
and negative pixel samples for all 26 patient CT scans over the 
entire abdominal CT Hounsfield range. The kernel density es- 
timators /+(X+) = i_Er=i K {X+ - X+) and /"(X-) = 
m Sj=i X (^X — X- ) are computed where KQ is assumed 

to be a Gaussian kernel with optimal computed bandwidth, 
for this data, of 3.039. Kernel sizes or bandwidth may be 
selected automatically using ID Likelihood-based search, as 
provided by the used KDE toolkit. The normalized likelihood 
ratio is calculated which becomes a probability value as a 
function of intensity in the range of Ff = [0 : 1 : 4095]. Thus, 
the probability of being considered pancreas is formulated 
(/+(^+)) 


as: - i^f+^x+)+f-(x 


■)) 


This function is converted as 


a pre-computed look-up table over iF = [0 : 1 : 4095], 
which allows very efficient 0(1) access time. Figure [^depicts 
the normalized KDE histogram computed and sample 2D 
probability response result on a CT slice. High probability 
regions are red in color and low probabilities in blue. In the 
original CT intensity domain the mean, median and standard 
deviation (std) statistics over the full 25x25 pixel range per 
image patch, P, are extracted. 

3) , Utilizing first the KDE probability response maps above 
and the superpixel CT masks described in Sec. |III-A as 
underlying supporting masks to each image patch, the same 
KDE response statistics within the intersected sub-regions, P’ 
of P, are extracted. The idea is that an image patch, P, may be 
divided into more than one superpixel. This set of statistics is 
calculated with respect to the most representative superpixel 
(that covers the patch center pixel). In this manner, object 
boundary-preserving intensity features are obtained |[35l . 

4) , The final two features for each axial slice (in the 
patient volumes) are the normalized relative x-axis and y-axis 
positions e[0,1], computed at each image patch center against 
the segmented body region (self-normalizecQ to patients with 
different body masses to some extent). Once all of the features 
are concatenated together, a total of 46 image patch-level 
features per superpixel are used to train a random forest (RE) 


^The axial reconstruction CT scans in our study have largely varying 
ranges or extends in the z-axis. If some anatomical landmarks, such as the 
bottom plane of liver, the center of kidneys, can be provided automatically, 
the anatomically normalized z-coordinate positions for superpixels can be 
computed and used as an additional spatial feature for RF classification. 
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classifier Cp. Image patch labels are obtained by directly 
borrowing the class information of their patch center pixels, 
based on the manual segmentation. 

To resolve the data unbalancing issue between the positive 
(pancreas) and negative (non-pancreas) class patches (during 
RF training), the sample weights for two classes are normal¬ 
ized so that the sum of the sample weights for each class will 
reach the same constant (i.e., assuming a uniform prior). This 
means, when calculating the empirical two class probabilities 
at any RF leaf node, the positive and negative samples (reached 
at that node) contribute with different weights. Pancreas class 
patch samples are weighted much higher than non-pancreas 
background class instances since they are more rare. Similar 
sample reweighting scheme is also exploited in the regression 
forest for anatomy localization ||36|. 

Six-fold cross-validation for RF training is carried-out. 
Response maps are computed for the image patch-level clas¬ 
sification and dense labeling. Fig. (d) and (h) show sample 
illustrative slices from different patients. High probability 
corresponding to the pancreas is represented by the red color 
regions (the background is blue). The response maps (denoted 
as allow several observations to be made. The most 

interesting is that the relative x and y positions as features 
allow for clearer spatial separation of positive and negative 
regions, via internal RF feature thresholding tests on them. 
The trained RF classifier is able to recognize the negative class 
patches residing in the background, such as liver, vertebrae 
and muscle using spatial location cues. In Fig. |^d,h) implicit 
vertical and horizontal decision boundary lines can be seen 
in comparison to Fig. [7jc,g). This demonstrates the superior 
descriptive and discriminative power of the feature descriptor 
on image patches (P and P’) than single pixel intensities. 
Organs with similar CT values are significantly depressed in 
the patch-level response maps. 

In summary, SIFT and its variations, e.g., D-SIFT have 
shown to be informative, especially through spatial pooling 
or packing A wide range of pixel-level correlations and 
visual information per image patch is also captured by the rest 
of 14 defined features. Both good classification specificity and 
recall have been obtained in cross-validation using Random 
Forest implementation of 50 trees and the minimum leaf size 
set as 150 (i.e., using the treehagger{^) function in Matlab). 


C. Patch-level Labeling via Deep Convolutional Neural Net¬ 
work: 

In this work, we use Convolutional Neural Network (CNN, 
or ConvNet) with a standard architecture for binary image 
patch classification. Five layers of convolutional filters first 
compute, aggregate and assemble the low level image features 
to more complex ones, in a layer-by-layer fashion. Other CNN 
layers perform max-pooling operations or consist of fully- 
connected neural network layers. The CNN model we adopted 
ends with a final two-way softmax classification layer for ’pan¬ 
creas’ and ’non-pancreas’ classes (refer to Fig. [^. The fully 
connected layers are constrained using “DropOut” in order 
to avoid over-fitting in training where each neuron or node 
has a probability of 0.5 to be reset with a 0-valued activation. 


DropOut is a method that behaves as a co-adaption regularizer 
when training the CNN 1^ . In testing, no DropOut operation 
is needed. Modem GPU acceleration allows efficient training 
and run-time testing of the deep CNN models. We use the 
publicly available code base of cuda-convnet'^ 

To extract dense image patch response maps, we use a 
straight-forward sliding window approach that extracts 2.5D 
image patches composed of axial, coronal and sagittal planes 
at any image positions (see Fig.[^. Deep CNN architecture can 
encode large scale image patches (even the whole 224x224 
pixel images 13^ ) very efficiently and no hard crafted image 
features are required any more. In this paper, the dimension 
of image patches for training CNN is 64x64 pixels which 
is significantly larger than 25x25 in Sec. III-B The larger 
spatial scale or context is generally expected to achieve more 
accurate patch labeling quality. For efficiency reasons, we 
extract patches every I voxels for CNN feedforward evaluation 
and then apply nearest neighbor interpolatioij^ to estimate the 
values at skipped voxels. Three examples of dense CNN based 
image patch labeling are demonstrated in Fig. We denote 
the CNN model generated probability maps as p^^^, 

The computational expense of deep CNN patch labeling 
per patch (in a sliding window manner) is still higher than 
Sec. III-B In practice, dense patch labeling by P^^ runs 
exhaustively at 3 pixel interval but p^^^ are only evaluated 
at pixel locations that pass the first stage of a cascaded random 
forest superpix el clas sification framework. This process is 
detailed in Sec. III-D where C\p is operated at a high recall 
(close to 100%) and low specificity mode to minimize the false 
negative rate (FNR) as the initial layer of cascade. The other 
important reason for doing so is to largely alleviate the training 
unbalance issue for pc 'at at C^p. After this initial pruning, 
the number ratio of non-pancreas versus pancreas superpixels 
changes from > 100 to ^ 5. The similar treatment is employed 
in our recent work (201 where all “Regional CNN” (R-CNN) 
based algorithmic variations ioi for pancreas segmentation is 
performed after a superpixel cascading. 


D. Superpixel-level Feature Extraction, Cascaded Classifica¬ 
tion and Pancreas Segmentation 


In this section, we trained three different superpixel-level 
random forest classifiers of Csp 1, Csp 2 and Csp 3. 
These three classifier components further formed two cascaded 
RF classification frameworks (F-1, F-2), as shown in Fig. 


10 The superpixel labels are inferred from the overlapping 
ratio r (defined in Sec. III-A| ) between the superpixel label 
map and the ground-truth pancreas mask. If r > 0.5, the 
superpixel is positive while if r < 0.2, the superpixel is 
assigned as negative. For the rest of superpixels that fall within 
0.2 < r < 0.5 (a relatively very small portion/subset of all 
superpixels), they are considered ambiguous and not assigned 
a label and as such not used in training. 

Training utilizes both the original CT image slices 
(/^^ in Fig. 10) and the probability response maps (P^^) via 


^ https://code.google.eom/p/cuda-convnet2 

^In our empirical testing, simple nearest neighbor interpolation seems 
sufficient due to the high quality of deep CNN probability predictions. 
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Fig. 7. Two sample slices from different patients are shown in (a) and (e). The corresponding superpixels segmentation (b,f), KDE probability response 
maps (c, g) and RF patch-level probability response maps (d, h) are shown. In (c,g) and (d,h), red represents highest probabilities. In (d,h) the purple color 
represents areas where probabilities are so small and can be deemed insignificant areas of interest. 
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Fig. 8. The proposed CNN model architecture is composed of five convolutional layers with max-pooling and two fully-connected layers 
with DropOut ll^ connections. A final 2-way softmax layer gives a probability p{x) of ‘pancreas’ and ‘non-pancreas’ per data sample (or 
image patch). The number and model parameters of convolutional filters and neural network connections for each layer are as shown. 








Fig. 9. Axial CT slice of a manual (gold standard) segmentation of the pancreas. From Left to Right, there are the ground-truth segmentation 
contours (in red); RF based coarse segmentation {aSrp}; a 2.5D input image patch to CNN and the deep patch labeling result using CNN. 


the hand-crafted feature based patch-level classification (i.e. 
Sec III-B). The 2D superpixel supporting maps (i.e., Sec |III-A| ) 
are used for feature pooling and extraction on a superpixel 
level. The CT pixel intensity/attenuation numbers and the 
per-pixel pancreas class probability response values (from 
dense patch labeling of or later) within each 

superpixel are treated as two empirical unordered distributions. 
Thus our superpixel classification problem is converted as 
modeling the difference between empirical distributions of 
positive and negative classes. We compute 1) simple statistical 
features of the lst-4th order statistics such as mean, std, 
skewness, kurtosis ED and 2) histogram-type features of eight 
percentiles (20%, 30%,..., 90%), per distribution in intensity 
or channel, respectively. Once concatenated, the resulted 
24 features for each superpixel instance is fed to train random 
forest classifiers. 


Due to the highly unbalanced quantities between foreground 


(pancreas) superpixels and background (the rest of CT vol¬ 
ume) superpixels, a two-tiered cascade of random forests 
are exploited to address this type of rare event detection 
problem Ea In a cascaded classification, C\p once trained is 
applied exhaustively on scanning all superpixels in an input CT 
volume. Based on the receiver operating characteristic (ROC) 
curves in Fig, pi] (Left) for C^p, we can safely reject or prune 
97% negative superpixels while maintaining nearly ^ 100% 
recall or sensitivity. The remained 3% negatives, often referred 
as “hard negatives” 14^ . along with all positives are employed 
to train the second C|p in the same feature space. Combining 
C\p and C|p is referred to as Framework 1 (F-1) in the 
subsequent sections. 

Similarly, we can train a random forest classifier C|p 
by replacing C|p’s feature extraction dependency on the 
pRF probability response maps, with the deep CNN patch 
classification maps of xhe same 24 statistical moments 















Fig. 10. The flow chart of input channels and component classiflers to form 
the overall frameworks 1 (F-1) and 2 (F-2). indicates the original CT 
image channel; repre sents the probability response map by RF based 

patch la beling in Sec. |lII B] and from deep CNN patch classiflcation 

in Sec. |lII-C respectively. Superpixel level random forest classifler C^p 
is trained with all positive and negative superpixels in and P^^ 

channels; C‘^p and C%p are learned using only “hard negatives” and all 
positives, in the pRF jCT pCNN channels, respectively. 

Forming C^p ^ C'lp, or C^p ^ C^p into two overall cascaded models 
results in frameworks F-1 and F-2. Note that F-1 and F-2 share the first 
layer of classiflcation cascades to coarsely prune about 96% initial superpixels 
using both intensity and P^^ features (refer to Fig. [Til Left). Only intensity 
based statistical features for C^gp produce significantly inferior results. On 
the other hand, C^p can be learned using all three available information 
channels of pCNN result in 36 superpixel-level 

features. Based on our initial empirical evaluation, y pCNN 
sufficient as using all three channels. This means that pCNN 

the optimal feature channel combination or configuration considering both the 
classiflcation effectiveness and model complexity. The coupling of p^^^ 
into Cgp consistently shows better segmentation results than P^^ for Cgp 
whereas p^^^ is not powerful enough to be used alone. 


and percentile features per superpixel, from two information 
channels and pc' at at ^ extracted to train C|p. Note 

that the CNN model that produces P^nat trained with the 
image patches sampled from only “hard negative” and positive 
superpixels (aligned with the second-tier RF classifiers C|p 
and Cgp). For simplicity, is only trained once with all 
positive and negative image patches. This will be referred to 
as Framework 2 (F-2) in the subsequent sections. F-1 only use 
P^^ whereas F-2 depends on both P^^ and P^nat ('^jth a 
little extra computational cost). 

The fiow chart of frameworks 1 (F-1) and 2 (F-2) is 
illustrated in Fig. The two-level cascaded random forest 
classification hierarchy is found empirically to be sufficient 
(although a deeper cascade is possible) and implemented to 
obtain F-1: Cgp and C|p, or F-2: Cgp and C|p. The 
binary 3D pancreas volumetric mask is obtained by stacking 
the binary superpixel labeling outcomes (after (7|p in F-1 
or ^SP in F-2) for each 2D axial slice, followed by 3D 
connected component analysis implemented in the end. By 
assuming the overall pancreas connectivity of its 3D shape, 
the largest 3D connected component is kept as the final 
segmentation. The binarization thresholds of random forest 
classifiers in C|p and C|p are calibrated using data in the 


training folds in 6-fold cross-validation, via a simple grid 
search. C|p can be learned using all three available infor¬ 
mation channels of |J |J P^^^ that will produce 
36 superpixel-level features. Based on our initial empirical 
evaluation, |J p^^^ is as suf ficient as using all three 
channels. As demonstrated in Sec. IV-B (7|p outperforms 
C‘gp mainly due to the sake of instead of P^^. 

However p^^^ is not powerful enough to be used alone in 
C|p. In 


standalone Pafc/i-ConvNet dense probability 
maps (without any post-processing) are processed for pancreas 
segmentation after using (F-1) as an initial cascade. The 
corresponding pancreas segmentation performance is not as 
accuracy as (F-1) or (F-2). This finding is in analogy to |[43l . 
EM where hand-crafted features need to be combined with 
deep image features to improve pulmonary nodule detection 
in chest CT scans or chest pathology detection using X-rays. 
Refer to Sec. IV-B for detailed comparison and numerical 
results. Recent computer vision work also demonstrate the 
performance improvement when combining hand-crafted and 
deep image features for image segmentation ||45]| and video 
action recognition mg) tasks. 


IV. DATA AND EXPERIMENTAL RESULTS 
A. Imaging Data 

80 3D abdominal portal-venous contrast enhanced CT scans 
(^70 seconds after intravenous contrast injection) acquired 
from 53 male and 27 female subjects are used in our study 
for evaluation. 17 of the subjects are from a kidney donor 
transplant list of healthy patients that have abdominal CT 
scans prior to nephrectomy. The remaining 63 patients are 
randomly selected by a radiologist from the Picture Archiving 
and Communications System (PACS) on the population that 
has neither major abdominal pathologies nor pancreatic cancer 
lesions. The CT datasets are obtained from National Institutes 
of Health Clinical Center. Subjects range in the age from 18 
to 76 years with a mean age of 46.8 ± 16.7. Scan resolution 
has 512x512 pixels (varying pixel sizes) with slice thickness 
ranging from 1.5 — 2.5 mm on Philips and Siemens MDCT 
scanners. The tube voltage is 120 kVp. Manual ground-truth 
segmentation masks of the pancreas for all 80 cases are 
provided by a medical student and verified/modified by a 
radiologist. 


B. Experiments 

Quantitative experimental results are assessed using six¬ 
fold cross-validation, as described in Sec. |ml] and Sec. 


III-D Several metrics to evaluate the accuracy and robustness 


of the methods are computed. The Dice similarity index 
which interprets the overlap between two sample sets, SI = 
2{\AtAB\)/{\A\ -f \B\) where A and B refer to the algorithm 
output and manual ground-truth 3D pancreas segmentation, 
respectively. The Jaccard index (JI) is another statistic used to 
compute similarities between the segmentation result against 
the reference standard, JI = {\A D B\)/{\A U B\), called 
“intersection over union” in the PASCAL VOC challenges 
Ca, ED. The volumetric recall (i.e. sensitivity) and precision 
values are also reported. 
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Fig. 11 illustrates the ROC curves for 6-fold cross-validation 


of the two-tiered superpixel-level classifiers Cgp and 
Cgp to assemble our frameworks F-1 and F-2, respectively. 
Red plots use each superpixel as a count to compute sensi¬ 
tivity and specificity. In blue plots, superpixels are weighted 
by their sizes (e.g., numbers of pixels and pixel sizes) for 
sensitivity and specificity calculation. The Area Under Curve 
(AUC) values of (7|p are noticeably lower than C^p AUC 
values (0.884 versus 0.997), indicating which is much harder 
to train since it employs the “hard negatives” as negative 
samples what are classified positively hy Cgp. Random Forest 
classifiers with 50 ^ 200 trees are evaluated, with similar 
empirical performances. In C|p, the dense patch-level image 
labeling (in the second level of cascade) is conducted by 
Deep Convolutional Neural Network (i.e., Pafc/i-ConvNet) to 
generate pc 'at at examples of dense CNN based image 


patch labeling are demonstrated in Fig. The AUC value of 
Cgp by swapping the probability response maps from to 
pCNN improve to 0.931, compared to 0.884 using C^p 
in the pixel-weighted volume metric. This demonstrates the 
performance benefit of using CNN for dense patch labeling 
(Sec. |III-C|) versus hand-crafted image features (Sec. [IlFB] ). 
See Fig. |11| (Right) and (Middle), respectively. Standalone 
Patc/i-ConvNet dense probability maps can be smoothed 
and thresholded for pancreas segmentation as reported in 
1201 where Dice coefficients 60.9 ± 10.4% are achieved. 
When there is only 12 features extracted from maps 

for C|p, the final pancreas segmentation accuracy drops to 
64.5 ± 12.3% in Dice scores, compared to F-1 (68.8 ±25.6%) 
and F-2 (70.7 ± 13.0%) in Table Similarly, recent work 
iS, EM observe that deep CNN image features should be 
combined with hand-crafted features for better performances 
in computer-aided detection tasks. 

Next, the pancreas segmentation performance evaluation is 
conducted in respect to the total number of patient scans used 
for the training and testing phases. Using our framework FI 
on 40, 60 and 80 (i.e. 50%, 75% and 100% of the total 80 
datasets) patient scans, the Dice, JI, Precision and Recall are 
computed under six-fold cross-validation. Table |I] shows the 
computed results using image patch-level features and multi¬ 
level classification (i.e., performing C^p and C|p on 
and P^^) and how performance changes with the additions of 
more patients data. Steady improvements of ^ 4% in the Dice 
coefficient and ^ 5% for the Jaccard index are observed, from 
40 to 60, and 60 to 80. Fig. illustrates some sample final 
pancreas segmentation results from the 80 patient execution 
(i.e. Test 3 in Table [I]) for two different patients. The results 
are divided into three categories: good, fair and poor. The 
good category refers to the computed Dice coefficient above 
90% (of 15 patients), fair result as 50% < Dice > 90% (49 
patients) and poor for Dice < 50% (16 patients). 

Then, we evaluate the difference of the proposed F-1 versus 
F-2 on 80 patients, using the same four metrics (i.e.. Dice, JI, 
precision and recall). Table |T| shows the comparison results. 
The same six-fold cross validation criterion is employed so 
that direct comparisons can be made. From the table, it can 
be seen that about 2% increase in the Dice coefficient was 
obtained by using F-2, but the main improvement can be 


noticed in the minimum values (i.e., the lower performance 
bound) for each of the metrics. Usage of deep patch labeling 
prevents the case of no pancreas segmentation while keeping 
slightly higher mean precision and recall values. The standard 
deviations also dropped nearly 50% comparing F-1 to F-2 
(from 25.6% to 13.0% in Dice; and 25.4% to 13.6% in JI). 
Note that F-1 has the similar standard deviation ranges with the 
previous methods ifTOll . (H, HI, (H, O and F-2 significantly 
improves upon all of them. From Fig. and Fig. it can 
be inferred that using the relative x-axis and y-axis positions 
as features aided in reducing the overall false negative rates. 
Based on Table |Ij we observe that F-2 provides consistent 
performance improvements over F-1, which implies that CNN 
based dense patch labeling shows more promising results (Sec. 
|III-C| ) than the conventional had-crafted image features and 
random forest patch classification alone (Sec. |III-B| ). Fig. 
depicts an example patient where F-2 Dice score is improved 
by 18.6% over F-1 (from 63.9% to 82.5%). In this particular 
case, the close proximity of the stomach and duodenum to 
the pancreas head in particular proves challenging for F-1 
without the CNN counterpart to distinguish. The surface-to- 
surface overlays illustrates how both frameworks compare to 
the ground truth manual segmentation. 

F-1 performs comparably to the state-of-the-art pancreas 
segmentation methods while F-2 slightly but consistently 
outperform others, even under six-fold cross-validation (CV) 
instead of the “leave-one-patient-out” (LOO) used in ifTOl . 
(3, m, la, 0,0. Note that our results are not directly 
or strictly comparable with ifTOl . 171 , HI, m, (O, |(6l since 
different datasets are used for evaluation. If under the same six¬ 
fold cross-validation, our bottom-up segmentation method can 
significantly outperform an implemented version of “multi¬ 
atlas and label fusion” (MALF) based on im, C3, on the 
pancreas segmentation dataset studied in this paper. Details are 
provided later in this section. Table reflects the comparison 
of Dice, JI, precision and recall results, between our methods 
of F-1, F-2 and other approaches, in multi-atlas registration 
and label fusion based multi-organ segmentation ifTOl , 171 , 
(D, B, 0 and multi-phase single organ (i.e., pancreas) 
segmentation O. Previous numerical results are found from 
the publications (TOl, d, HI, El. El. IS- We choose the best 
result out of different parameter configurations in HI- Based 
on 80 CT datasets, our results are comparable and slightly 
better than the recent state-of-the art work Col, (3, 0, 0, 
0. For example, Dice coefficients of 68.8% ± 25.6% using 
F-1 and 70.7% ± 13.0% using F-2 are obtained (6-fold CV), 
versus 69.6% ± 16.7% in 0, 65.5% in 0, 65.5% ± 18.6% 
in Col and 69.1% ± 15.3% in 0 (LOO). 

We exploit two variations of pancreas segmentation in a 
perspective of bottom-up information propagation from im¬ 
age patches to (segments) superpixels. Both frameworks are 
carried-out in a six-fold cross validation (CV) manner. Our 
protocol is arguably harder than the “leave-one-out” (LOO) 
criterion in cni, 0,0,0,0 since less patient datasets are 
used in training and more separate patient scans for testing. In 
fact, 0 does demonstrate a notable performance drop from 
using 149 patients in training versus 49 patients under LOO, 
i.e., the mean Dice coefficients decreased from 69.6%±16.7% 
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False Positive Rate Paise Positive Rate False Positive Rate 


Fig. 11. ROC curves to analyze the superpixel classification results, in a two layer cascade of RF classifiers: (Left) the first layer classifier, C^p and (Middle) 
the second layer classifier, C|p; (Right) the alternative second layer classifier, C|p. Red plots are using each superpixel as a count to calculate sensitivity 
and specificity. In blue plots, superpixels are weighted by their sizes (e.g., numbers of pixels and pixel sizes) for sensitivity and specificity calculation. 





Fig. 12. Three examples of deep CNN based image patch labeling probability response maps per row. Red color shows stronger pancreas class response and 
blue presents weaker response. From Left, Center to Right are the original CT image, CT image with annotated pancreas contour in red, and CNN response 
map overlaid CT image. 


to 58.2% zb 20.0%. This indicates that the multi-atlas fusion 
approaches oni, 0, Q, El, (a, 0 may actually achieve 
lower segmentation accuracies than reported, if under the six¬ 
fold cross validation protocol. At 40 patients, our result using 
framework 1 is 2.2% better than the reported results by fTll 
using 50 patients (Dice coefficients of 60.4% versus 58.2%). 
Comparing to the usage of — 1 patient datasets directly in 
the memory for multi-atlas registration methods, our learned 
models are more compactly encoded into a series of patch- 
and superpixel-level random forest classifiers and the CNN 
classifier for patch labeling. The computational efficiency also 
has been drastically improved in the order of 6 ^ 8 minutes 
per testing case (using a mix of Matlab and C implementation. 


^ 50% time for superpixel generation), compared to others 
requiring 10 hours or more. The segmentation framework (F-2) 
using deep patch labeling confidences is also more numerically 
stable, with no complete failure case and noticeable lower 
standard deviations. 

Comparison to R-CNN and its variations HtI . 1201 : 

The conventional approach for classifying superpixels or im¬ 
age segments in computer vision is “bag-of-words” ||48l, 
l49l . “Bag-of-words” methods compute dense SIFT, HOG 
and LBP image descriptors, embed these descriptors through 
various feature encoding schemes and pool the features inside 
each superpixel for classification. Both model complexity and 
computational expense 1481 . 1491 are very high, comparing 
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Fig. 13. Pancreas segmentation results with the computed Dice coefficients for one good (Top Row) and two fair (Middle, Bottom Rows) segmentation 
examples. Sample original CT slices for both patients are shown in (Left Column) and the corresponding ground truth manual segmentation in (Middle 
Column) are in yellow. Final computed segmentation regions are shown in red in (Right Column) with Dice coefficients for the volume above each slice. 
The zoomed-in areas of the slice segmentation in the orange boxes are shown to the right of the image. 


N SI (%) JI (%) Precision (%) Recall (%) 


F-1 40 60.4 ±22.3 [2.0, 96.4] 46.7 ± 22.8 

0, 93.0 

55.6 ±29.8 [1.2, 100] 80.8 ± 21.2 [4.8, 99.8] 

F-1 60 64.9 ± 22.6 

0, 94.2 

51.7 ±22.6 

0, 89.1 

70.3 ± 29.0 

0, 100 

69.1 ± 25.7 [0, 98.9] 

F-1 80 68.8 ± 25.6 

0, 96.6 

57.2 ± 25.4 

0, 93.5 

71.5 ±30.0 

0, 100 

72.5 ± 27.2 [0, 100] 


F-2 80 70.7 ±13.0 [24.4, 85.3] 57.9^13.6 [13.9, 74.4] 71.6^10.5 [34.8, 85.8] 74.4^15.1 [15.0, 90.9] 


TABLE 1 

Examination of varying number of patient datasets using framework 1, in four metrics of Dice, JI, precision and recall. Mean, 

STANDARD DEVIATION, LOWER AND UPPER PERFORMANCE RANGES ARE REPORTED. COMPARISON OF THE PRESENTED FRAMEWORK 1 (F-1) VERSUS 

FRAMEWORK 2 (F-2) IN 80 PATIENTS IS ALSO PRESENTED. 


with ours (Sec. |III-D1). Recently, a “Regional CNN” (R- 
CNN) iol, ED method is proposed and shows substantial 
performance gains in PASCAL VOC object detection and 
semantic segmentation benchmarks GSl, compared to previous 
“Bag-of-words” models. A simple R-CNN implementation on 
pancreas segmentation has been explored in our previous work 
ED which reports evidently worse result (Dice coefficient 
62.9% zb 16.1%) than our F-2 framework (Dice 70.7 ± 13.0%) 
that spatially pools the CNN patch classification confidences 
per superpixel. Note that R-CNN fJOl, is not an ’’end- 
to-end” trainable deep learning system: R-CNN first uses the 
pre-trained or fine-tuned CNNs as image feature extractors 
for superpixels and then the computed deep image features 
are classified by support vector machine models. 

Our recent work 1201 is an extended version of pancreas 
segmentation from the region-based convolutional neural net¬ 
works (R-CNN) for semantic image segmentation ED, CSl- 
In 1201 . 1) we exploit multi-level deep convolutional networks 
which sample a set of bounding boxes covering each image 
superpixel at multiple spatial scales in a zoom-out fashion 


(sol; 2) the best performing model in 1^ is a stacked R^- 
ConvNet which operates in the joint space of CT intensities 
and the Patc/i-ConvNet dense probability maps, similar to 
F-2. With the above two method extensions, 1201 reports the 
Dice coefficient of 71.8 ± 10.7% in four-fold cross-validation 
(which is slightly better than 70.7 ± 13.0% of F-2 using the 
same dataset). However, ll^ can not be directly trained and 
tested on the raw CT scans as in this paper, due to the data 
high-imbalance issue between pancreas and non-pancreas su¬ 
perpixels. There are overwhelmingly more negative instances 
than positive ones if training the CNN models directly on all 
image superpixels from abdominal CT scans. Therefore, given 
an input abdomen CT, an initial set of superpixel regions is first 
generated or filtered by a coarse cascading process of operating 
the random forests based pancreas segmentation HU (similar 
to F-1), at low or conservative classification thresholds. Over 
96% original volumetric abdominal CT scan space has been 
rejected for the next step (see Fig. jT Left). For pancreas 
segmentation, these pre-labeled superpixels serve as regional 
candidates with high sensitivity (>97%) but low precision 
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Fig. 14. Examples of pancreas segmentation results using F-1 and F-2 with the computed Dice coefficients for one patient. Original CT slices for the patient 
are shown in Column (a) and the corresponding ground truth manual segmentation in Column (b) are in yellow. Final computed segmentation using F-2 
and F-1 are shown in red in Columns (c,d) with Dice coefficients for the volume above first slice. The zoomed-in areas of the slice segmentation in the 
orange boxes are shown to the right of the images. Their surface-to-surface distance map overlaid on the ground truth mask is demonstrated in Columns (c,d) 
Bottom and the corresponding ground truth segmentation mask in Column (b) Bottom are in red. The red color illustrates higher difference and green for 
smaller distance. 


(generally called Candidate Generation or CG process). The 
resulting initial DSC is 27% on average. Then evaluates 
several variations of CNNs for segmentation refinement (or 
pruning). F-2 performs comparably to the extended R-CNN 
version for pancreas segmentation and is able to run 
without using F-1 to generate pre-selected superpixel can¬ 
didates (which nevertheless is required by ia, GOI). As 
discussed above, we would argue that these hybrid approaches 
combining or integrating deep and non-deep learning compo¬ 
nents (like this work and 1201 . il47l . Eol, ED, ED) will co¬ 
exist with the other fully “end-to-end” trainable CNN systems 
ED, ESI that may produce comparable or even inferior 
segmentation accuracy levels. For example, ED is a two- 
staged method of deep CNN image labeling followed by fully 
connected Conditional Random Field (CRF) post-optimization 
1541 . achieving 71.6% intersection-over-union value versus 
62.2% in |53l, on PASCAL VOC 2012 test set for semantic 
segmentation task ca. 

Comparison to MALF (under six-fold CV): For the ease 
of comparison to the previously well studied “multi-atlas and 
label fusion” (MALF) approaches, we implement a MALF 
solution for pancreas segmentation using the publicly available 
C-f-f code bases CD, ca The performance evaluation crite¬ 
rion is the same six-fold patient splits for cross validation, 
not the “leave-one-patient-out” (LOO) in |Tol, Q, |8l, |9l, 0, 
m. Specifically, each atlas in the training folds is registered 
to every target CT image in the testing fold, by the fast free¬ 
form deformation algorithm developed in Nifty Reg ifTTll . Cubic 
B-Splines are used to deform a source image to optimize an 
objective function based on the normalized mutual information 
and a bending energy term. Grid spacing along three axes 
are set as 5 mm. The weight of the bending energy term is 
0.005 and the normalized mutual information with 64 bins are 


used. The optimization is performed in three coarse-to-fine 
levels and the maximal number of iterations per level is 300. 
More details can be found in CD. The registrations are used 
to warp the pancreas in the atlas set (66, or 67 atlases) to 
the target image. Nearest-neighbor interpolation is employed 
since the labels are binary images. For each voxel in the target 
image, each atlas provided an opinion about the label. The 
probability of pancreas at any voxel x in the target U was 
determined by L{x) = where Li{x) is the 

warped i-th pancreas atlas and uji{x) is a weight assigned to 
the i-th atlas at location x with ^ the 

number of atlases. In our 6-fold cross validation experiments 
n = 66 or 67. We adopt the joint label fusion algorithm 
C3, which estimates voting weights uji {x) by simultaneously 
considering the pairwise atlas correlations and local image 
appearance similarities at x. More details about how to capture 
the probability that different atlases produce the same label 
error at location x via a formulation of dependency matrix 
can be found in C3 The final binary pancreas segmentation 
label or map L{x) in target can be computed by thresholding 
on L{x). The resulted MALF segmentation accuracy in Dice 
coefficients are 52.51 ±20.84% in the range of [0%, 80.56%]. 
This pancreas segmentation accuracy is noticeably lower than 
the mean Dice scores of 58.2% ^ 69.6% reported in ifTOl . (71, 
E), |9), El, ID under the protocol of “leave-one-patient-out” 
(LOO) for MALF methods. This observation may indicate the 
performance deterioration of MALF from LOO (equivalent to 
80-fold CV) to 6-fold CV which is consistent with the finding 
that the segmentation accuracy drops from 69.6% to 58.2% 
when only 49 atlases are available instead of 149 (71 . 

Furthermore, we take about 33.5 days to fully conduct the 
six-fold MALF cross validation experiments using a Windows 
server; whereas the proposed bottom-up superpixel cascade 
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Reference 

N 

SI (%) 

JI (%) 

Precision (%) 

Recall (%) 

0 

20 

- 

57.9 

- 

- 

y 

28 

- 

46.6 

- 

- 

HI 

150 

69.6 ± 16.7 

55.5 ±17.1 

67.9 ± 18.2 

74.1 ± 17.1 

HI 

50 

58.2 ± 20.0 

43.5 ±17.8 

- 

- 

in 

100 

65.5 

49.6 

70.7 

62.9 

Go] 

100 

65.5 ± 18.6 

- 

- 

- 

|8j 

100 

69.1 ± 15.3 

54.6 

- 

- 

Framework I 

80 

68.8 ±25.6 

57.2 ±25.4 

71.5 ±30.0 

72.5 ± 27.2 

Framework 2 

80 

70.7 ± 13.0 

57.9 ±13.6 

71.6 ± 10.5 

74.4 ± 15.1 

MALF 

80 

52.5 ± 20.8 

38.1 ± 18.3 

- 

- 
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Comparison of F-1 and F-2 in six-fold cross validation to the recent state-oe-the-art methods fTol . Q, (8), (9), 0, El IN LOO AND 

OUR IMPLEMENTATION OE “MULTI-ATLAS AND LABEL EUSION” (MALF) USING PUBLICLY AVAILABLE C++ CODE BASES ifTTl . (12) UNDER THE SAME 
SIX-EOLD CROSS VALIDATION. THE PROPOSED BOTTOM-UP PANCREAS SEGMENTATION METHODS OE F-1 AND F-2 SIGNIEICANTLY OUTPEREORM THEIR 
MALF COUNTERPART: 68.8 ± 25.6% (F-1), 70.7 ± 13.0% (F-2) VERSUS 52.51 ± 20.84% IN Dice coeeeicients (mean±std).. 


approach finishes in ^ 9 hours for 80 cases (6.7 minutes per 
patient scan on average). In summary, using the same dataset 
and under six-fold cross-validation, our bottom-up segmenta¬ 
tion method significantly outperforms its MALF counterpart: 
70.7 ± 13.0% versus 52.51 ± 20.84% in Dice coefficients, 
while being approximately 90 times faster. Converting our 
Matlab/C-F-F implementation into pure C-f-f should expect 
further 2^3 times speed-up. 

V. CONCLUSION AND DISCUSSION 

In this paper, we present a fully-automated bottom-up 
approach for pancreas segmentation in abdominal computed 
tomography (CT) scans. The proposed method is based on a 
hierarchical cascade of information propagation by classify¬ 
ing image patches at different resolutions and multi-channel 
feature information pooling at (segments) superpixels. Our 
algorithm fiow is a sequential process of decomposing CT slice 
images as a set of disjoint boundary-preserving superpixels; 
computing pancreas class probability maps via dense patch 
labeling; classifying superpixels via aggregating both intensity 
and probability information to form image features that are 
fed into the cascaded random forests; and enforcing a simple 
spatial connectivity based post-processing. The dense image 
patch labeling can be realized by efficient random forest clas¬ 
sifier on hand-crafted image histogram, location and texture 
features; or deep convolutional neural network classification 
on larger image windows (i.e., with more spatial contexts). 

The main component of our method is to classify super¬ 
pixels into either pancreas or non-pancreas class. Cascaded 
random forest classifiers are formulated for this task and 
performed on the pooled superpixel statistical features from 
intensity values and supervisedly learned class probabilities 
(pi?F and/or xhe learned class probability maps 

(e.g., and are treated as the supervised semantic 

class image embeddings which can be implemented, via an 
open framework by various methods, to learn the per-pixel 
class probability response. 

To overcome the low image boundary contrast issue in 
superpixel generation, which is however common in medical 
imaging, we suggest that efficient supervised edge learn¬ 
ing techniques may be utilized to artificially “enhance” the 
strength of semantic object-level boundary curves in 2D or 
surface in 3D. For example, one of the future directions is 


to couple or integrate the structured random forests based 
edge detection |[55l into a new image segmentation frame¬ 
work (MCG: Multiscale Combinatorial Grouping) ||3^ which 
permits a user-customized image gradient map. This new 
approach may be capable to generate image superpixels that 
can preserve even very weak semantic object boundaries 
well (in the image gradient sense) and subsequently prevent 
segmentation leakage. 

Finally, voxel-level pancreas segmentation masks can be 
propagated from the stacked superpixel-level classifications 
and further improved by an efficient boundary refinement 
post-processing, such as the narrow-band level-set based 
curve/surface evolution 1^ , 15^ . or the learned intensity 
model based graph-cut □ Further examination into the sub¬ 
connectivity processes for the pancreas segmentation frame¬ 
work that considers the spatial relationships of splenic, portal 
and superior mesenteric veins with pancreas may be needed 
for future work. 
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